-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs for graphs that depend on assets #12597
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, but looks good!
docs/content/guides/dagster/how-assets-relate-to-ops-and-graphs.mdx
Outdated
Show resolved
Hide resolved
@@ -262,6 +263,36 @@ Note that in most cases, it is usually possible to pass some data dependency. In | |||
|
|||
Dagster also provides more advanced abstractions to handle dependencies and IO. If you find that you are finding it difficult to model data dependencies when using external storage, check out [IO managers](/concepts/io-management/io-managers). | |||
|
|||
### Loading an asset as an input | |||
|
|||
You can supply an asset as an input to one of the ops in a graph. Dagster can then use the [IO manager](/concepts/io-management/io-managers) on the asset to load the input value for the op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I override the IOManager used by the asset? I can do that with @asset(Ins={"key": AssetIn(..., input_manager_key: "overriding_io_mgr")})
for assets, how do I do it with ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't put together that this is an "unconnected input" but I guess that makes sense, OK.
If the asset is partitioned, then: | ||
|
||
- If the job is partitioned, the corresponding partition of the asset will be loaded. | ||
- If the job is not partitioned, then all partitions of the asset will be loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "all partitions will be loaded" mean for the shape of the value? Is it a list, or a dictionary, or a generator, or something else? I'm wondering how I ought to write my op to handle that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type depends on the I/O manager implementation:
- The Pandas and PySpark type handlers of the DB IO managers (Snowflake, DuckDB, BigQuery) always return a single DataFrame, which can includes values from all the partitions.
- When loading an input that corresponds to multiple partitions, the UPathIOManager returns a dictionary that maps each input partition key to the input value for that partition key.
This needs better docs, but I don't think this is the right place to put them.
2f0f0b4
to
a01726c
Compare
a01726c
to
016d205
Compare
Summary & Motivation
Motivated by this feedback: https://dagster.slack.com/archives/C01U5LFUZJS/p1677548018200809
How I Tested These Changes